NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Transformers Provably Learn Two-Mixture of Linear Classification via Gradient Flow

Yang, Hongru; Wang, Zhangyang; Lee, Jason D; Liang, Yingbin (April 2025, Neural Information Processing Systems (NeurIPS))

Full Text Available
Transformers Provably Learn Two-Mixture of Linear Classification via Gradient Flow

Yang, Hongru; Wang, Zhangyang; Lee, Jason D; Liang, Yingbin (April 2025, International Conference on Learning Representations (ICLR))

Full Text Available
Transformers provably learn two-mixture of linear classification via gradient flow

Yang, Hongru; Wang, Zhangyang; Lee, Jason D; Liang, Yingbin (April 2025, International Conference on Learning Representations (ICLR))

Full Text Available
Random Pruning Over-parameterized Neural Networks Can Improve Generalization: A Training Dynamics Analysis

Yang, Hongru; Liang, Yingbin; Guo, Xiaojie; Wu, Lingfei; Wang, Zhangyang (April 2025, Journal of machine learning research)

Full Text Available
Random Pruning Over-parameterized Neural Networks Can Improve Generalization: A Training Dynamics Analysis

Yang, Hongru; Liang, Yingbin; Guo, Xiaojie; Wu, Lingfei; Wang, Zhangyang (April 2025, Journal of machine learning research)

Full Text Available
Random Pruning Over-parameterized Neural Networks Can Improve Generalization: A Training Dynamics Analysis

Yang, Hongru; Liang, Yingbin; Guo, Xiaojie; Wu, Lingfei; Wang, Zhangyang (April 2025, Journal of Machine Learning Research (JMLR))

Full Text Available
Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis

Yang, Hongru; Kailkhura, Bhavya; Wang, Zhangyang; Liang, Yingbin (December 2024, Conference on Neural Information Processing Systems (NeurIPS 2024))

Understanding the training dynamics of transformers is important to explain the impressive capabilities behind large language models. In this work, we study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words. In the literature of studying training dynamics of transformers, several simplifications are commonly adopted such as weight reparameterization, attention linearization, special initialization, and lazy regime. In contrast, we analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear MLP layer from random initialization, and provide a framework of analyzing such dynamics via a coupled dynamical system. We establish near minimum loss and characterize the attention model after training. We discover that gradient flow serves as an inherent mechanism that naturally divide the training process into two phases. In Phase 1, the linear MLP quickly aligns with the two target signals for correct classification, whereas the softmax attention remains almost unchanged. In Phase 2, the attention matrices and the MLP evolve jointly to enlarge the classification margin and reduce the loss to a near minimum value. Technically, we prove a novel property of the gradient flow, termed \textit{automatic balancing of gradients}, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss. We also conduct experiments to verify our theoretical results.
more » « less
Full Text Available
Training dynamics of transformers to recognize word co-occurrence via gradient flow analysis

Yang, Hongru; Kailkhura, Bhavya; Wang, Zhangyang; Liang, Yingbin (December 2024, Advances in Neural Information Processing Systems (NeurIPS))

Full Text Available
Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis

Yang, Hongru; Kailkhura, Bhavya; Wang, Zhangyang; Liang, Yingbin (November 2024, Neural Information Processing Systems (NeurIPS))

Full Text Available
Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK

Yang, Hongru; Jiang, Ziyu; Zhang, Ruizhe; Liang, Yingbin; Wang, Zhangyang (November 2024, Journal of machine learning research)

We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime, where the networks' biases are initialized to some constant rather than zero. We prove that under such initialization, the neural network will have sparse activation throughout the entire training process, which enables fast training procedures via some sophisticated computational methods. With such initialization, we show that the neural networks possess a different limiting kernel which we call bias-generalized NTK, and we study various properties of the neural networks with this new kernel. We first characterize the gradient descent dynamics. In particular, we show that the network in this case can achieve as fast convergence as the dense network, as opposed to the previous work suggesting that the sparse networks converge slower. In addition, our result improves the previous required width to ensure convergence. Secondly, we study the networks' generalization: we show a width-sparsity dependence, which yields a sparsity-dependent Rademacher complexity and generalization bound. To our knowledge, this is the first sparsity-dependent generalization result via Rademacher complexity. Lastly, we study the smallest eigenvalue of this new kernel. We identify a data-dependent region where we can derive a much sharper lower bound on the NTK's smallest eigenvalue than the worst-case bound previously known. This can lead to improvement in the generalization bound.
more » « less
Full Text Available

« Prev Next »

Search for: All records